Scaling multi-instance support vector machine to breast cancer detection on the BreaKHis dataset

Abstract Motivation Breast cancer is a type of cancer that develops in breast tissues, and, after skin cancer, it is the most commonly diagnosed cancer in women in the United States. Given that an early diagnosis is imperative to prevent breast cancer progression, many machine learning models have been developed in recent years to automate the histopathological classification of the different types of carcinomas. However, many of them are not scalable to large-scale datasets. Results In this study, we propose the novel Primal-Dual Multi-Instance Support Vector Machine to determine which tissue segments in an image exhibit an indication of an abnormality. We derive an efficient optimization algorithm for the proposed objective by bypassing the quadratic programming and least-squares problems, which are commonly employed to optimize Support Vector Machine models. The proposed method is computationally efficient, thereby it is scalable to large-scale datasets. We applied our method to the public BreaKHis dataset and achieved promising prediction performance and scalability for histopathological classification. Availability and implementation Software is publicly available at: https://1drv.ms/u/s!AiFpD21bgf2wgRLbQq08ixD0SgRD?e=OpqEmY. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Every year, approximately 250 000 women in the United States are diagnosed with breast cancer (CDC, 2020). Differentiating between the different types of carcinomas (ductal, lobular, mucinous and papillary) is essential for making an accurate diagnosis. Histopathology allows for this close examination that leads to patients receiving a personalized treatment and can increase their likelihood of survival. Histopathology is the examination of tissue sections with a microscope to aid in the diagnosis of illnesses such as cancer and inflammatory diseases and increase the likelihood of survival. These tissue sections can also be called whole-slide images (WSIs) or histopathological images when they are digitized. Traditionally, clinical disciplines such as radiology and pathology have relied heavily on specialized training to detect the presence of these diseases in histopathological images. A diagnosis is based on features exhibited by tissue samples on a cellular level. An anomaly in the cell architecture and the presence or absence of certain biological attributes can be strong indicators of a particular disease. For example, abnormal cells that divide uncontrollably, also known as carcinomas, lead to a cancer diagnosis when detected. A pathologist can detect this abnormal growth/tumor from a histopathological image and assess which regimen should be prescribed to halt the progression of the disease. This pattern analysis is an essential component of precision medicine, since it makes a diagnosis based on patient-specific histopathological images.
Modern medical procedures and technologies have increased the number of biopsies performed, and consequently, the number of histopathological images collected has increased far beyond the reasonable workload of pathologists (van der Laak et al., 2021). This poses an impediment to precision medicine, since it requires the analysis of vast amounts of medical data. However, recent advancements in the field of artificial intelligence have shown promise in automating the analysis of histopathological images and improving the accuracy and speed of a diagnosis. Just as a pathologist finds patterns that help detect cellular abnormalities, algorithms can be used to extract features from an image such as pixel intensity (Hamilton et al., 2007), texture (Haralick, 1979) and Zernike moments (Khotanzad and Hong, 1990). The application of computational algorithms to diagnostic fields can aid pathologists in drawing accurate and precise conclusions in an efficient and reproducible manner (Gurcan et al., 2009).
In our research, we focused our analysis efforts on developing a classification model for the public BreaKHis dataset (Spanhol et al., 2015), which is composed of 7909 histopathological images of different types of benign and malignant breast cancer tumors. This dataset has been instrumental in our work, since its structure allows for extensive and precise classification of histopathological images. The dataset is split into benign and malignant categories, and these are further subdivided into different types of carcinomas. The WSIs in each tumor type group are then amplified to four different magnification factors, and they are usually segmented into the patches because of their large size. As a result, the classification problem is naturally formulated as a multi-instance learning (MIL) problem (Brand et al., 2021a,b;Wang et al., 2011) to determine which segments of tissue in an image exhibit an indication of an abnormality. MIL is an area of machine learning in which training and testing V C The Author(s) 2022. Published by Oxford University Press. data are organized into sets of instances known as bags. MIL is a weakly supervised learning algorithm, which means that the data are frequently provided at the bag level instead of the instance level, therefore clinicians do not need to spend a lot of resources into characterizing each image in the training dataset obtained from a biopsy. Doctors only need to label/diagnose the bag or patient as malignant and benign, and the rest of the instances or histopathological images follow suit. Despite being a very powerful approach, MIL remains a challenging problem as many standard machine learning approaches rely on fixed-length vector input which are not applicable to the dataset with a varying number of instances per bag. At the same time, MIL models should be translation invariant against the instances of each set input; the prediction of model should not be affected by the order of instances. In our work, breast cancer histopathological images are represented by a bag (set) of patches, as illustrated in Figure 1. The bags, or images, are labeled as either malignant or benign while the instances, or patches, remain unlabeled (Brand et al., 2021a,b;Wang et al., 2011). Taking these facts into account, we propose the Primal-Dual Multi-Instance SVM (pdMISVM) method (Brand et al., 2021a), which improves the efficiency of optimization compared to the previously mentioned MIL approaches.

Related works
To ease the heavy workloads of pathologists, Computer Aided Diagnostics (CAD) has emerged to determine whether an image shows any indication of carcinoma and, if so, where the abnormality is located within the histopathological image. One of the widely used approaches is to use Convolutional Neural Networks (CNNs) trained by the patches extracted from WSIs. CNN is the combination of convolutional layers and consecutive fully connected layers and their concept comes from the working principle of receptive fields and neurons of the human eye and brain. Krizhevsky et al. (2012) has shown that the deep structure of the CNN can achieve state-of-the-art performance in image recognition tasks. Their model, AlexNet, has been successfully applied to BreakHis by Titoriya and Sachdeva (2019). However, CAD based on the CNNs still faces obstacles, because training a CNN requires a large amount of training data with the big computational burdens. These requirements make it difficult for predictive models to be combined with the CAD systems. The SVM applications as a practical alternative to deep learning models has also been studied (Zheng et al., 2014). For example, SVM with sparsity inducing regularization (Kahya et al., 2017) can achieve the promising accuracy higher than 90% in image classification. Although SVM models have fewer trainable parameters than deep learning models, and therefore require less time and computational cost to train, their training time and computational complexity increases rapidly as the number of input features increases (Kumar and Rath, 2015;Peng et al., 2016). Another problem of the traditional SVM models is that they are single-instance learning (SIL) models, i.e. they are not able to handle the varying number of input instances, while the WSIs are usually segmented into the multiple patches (instances) because of the large size of WSIs.
In light of the above issue, multiple instance learning would be the better choice for the disease detection applications, and these types of algorithms have also been evaluated on the BreaKHis dataset previously. This is because, in a SIL model, classification becomes difficult when a single patch of insignificant region on the image is given. Meanwhile, MIL models enable the correct classification from some important patches, even if most patches do not include indication of carcinoma. To deal with the multi-instance dataset, several MIL methods have achieved satisfactory results in the past when performing similar tasks especially on the BreaKHis dataset (Sudharshan et al., 2019).  (Andrews et al., 2002), sparse Multi-Instance Learning (sMIL) and sparse balanced MIL (sbMIL) (Bunescu and Mooney, 2007), and Normalized Set Kernel (NSK) and Statistics Kernel (STK) (Gärtner et al., 2002). These are all methods that have been deemed successful at correctly labeling the bags in the testing dataset as either malignant or benign. However, despite the promising performance of MIL models, we point out that there is a lack of discussions on the scalability of MIL models or they do not scale to the large dataset. In addition, it is difficult to efficiently learn the hypothesis

Benign Malignant
Multi-Instance Support Vector Machine Fig. 1. A visualization of our processing pipeline for our MIL algorithm applied to the BreaKHis dataset. We sample the patches (instances) from the histopathological images (bags) of four different magnification levels, and process the patches with PFTAS method which results in d-features vectors of n i instances for each image. Finally, our multiinstance SVM classify the bags as malignant or benign space of MIL models which involves with the multiple instances (Wei et al., 2014). Unlike the previous existing algorithms, our approach scales well to larger datasets, which adds to its value for practical use.

The paper organization
In the remainder of this manuscript, we present an objective and associated solution of the novel pdMISVM that extends to large-scale data. We derive the optimization algorithm based on the multi-block alternating direction method of multipliers (ADMM) (Hong and Luo, 2017) to bypass the quadratic programming problem that comes from the typical SVM and MISVM models. We further improve the ADMM derivation to decrease the complexity with respect to the large number of features. Lastly, we provide an application of the proposed method to classify the bag of patches and identify disease relevant patches, which can reduce the workload of pathologists.

Materials and data sources
We perform classifications on the publicly available BreaKHis dataset (https://web.inf.ufpr.br/vri/databases/breast-cancer-histopathologicaldatabase-breakhis/). The BreaKHis dataset was built in collaboration with the P&D Laboratory in Parana, Brazil. BreaKHis was first introduced in 'A Dataset for Breast Cancer Histopathological Image Classification' by Spanhol et al. (2015). The dataset comprises 7909 microscopic biopsy images of breast tumor tissue images collected in a clinical study from January 2014 to December 2014. The dataset contains 2480 benign and 5429 malignant tissue samples. The images were collected using different magnifying factors (40Â, 100Â, 200Â and 400Â), and they were organized into these categories in the dataset. The samples were acquired from 82 patients whose data were anonymized. Samples were generated from breast tissue biopsy slides, stained with hematoxylin and eosin (HE) and collected by surgical open biopsy (SOB). They were labeled in the P&D Laboratory, and the diagnosis of each slide was determined by experienced pathologists (Spanhol et al., 2016). We segment the histopathological images into patches. In our experiments, each patch contains a 64 Â 64 section of pixels and we extracted a random number (sampled from f1, 5, 10g) of patches for each of the images. However, in the raw tissue segments, the elements of interest such as nuclei may not be clearly visible. In light of this issue, we extract the feature vector through Parameter Free Threshold Statistics (PFTAS) (Hamilton et al., 2007) for each patch. Based on experimental results of previous study (Spanhol et al., 2015), PFTAS outperforms the other features such as Local Binary Patterns (LBP) (Ojala et al., 2002), Completed LBP (CLBP) (Guo et al., 2010), Local Phase Quantization (LPQ) (Ojansivu and Heikkilä, 2008) and Grey-Level Co-occurrence Matrix (GLCM) (Haralick et al., 1973) in BreaKHis dataset. PFTAS is a method that extracts texture features by counting the number of black pixels in the neighborhood of a pixel. The total count for all the pixels in a given image is stored in a nine-bin histogram (Hamilton et al., 2007). The thresholding is done by Otsu's algorithm (Otsu, 1979) and the extractor returns a 162-dimensional feature vector. The 162 features consist of 3 channels (RGB) Â 9 pixels Â 3 thresholding ranges concatenated with its bitwise negated version. The Otsu's algorithm iteratively finds the optimal threshold value by maximizing the inter-class intensity variance. As a result, PFTAS features are robust against the varying mean of intensity distribution for each RGB channel across images. To control the number of features, we concatenate several patches, and the final number of features is a multiple of 162. In our experiments, 7909 bags (images) are involved, of which 5429 are malignant and 2480 are benign.

Methods
In this section, we develop an objective for the scalable pdMISVM algorithm designed to handle multi-instance data. Our formulation for pdMISVM provides an efficient solution to avoid dependency on a quadratic programming or least-squares approach.

Notation
In this article, we denote matrices as M, vectors as m and scalars as m. The ith row and jth column of M are m i and m j , respectively. Similarly, m i j is the scalar value indexed by the ith row and jth column of M. The matrix M p corresponds to the pth column-block of M. Each bag X i ¼ fx 1 i ; . . . ; x ni i g contains n i patches and its associated label of mth class is represented by y m i 2 fÀ1; 1g.

A primal-dual multi-instance support vector machine
The K class multi-instance support vector machine was proposed by Andrews et al. (2002), which solve the following objective: (1) where ðÁÞ þ ¼ maxðÁ; 0Þ and its decision function is given by: as illustrated in Figure 2. The MISVM objective in Equation (1) is generally difficult to solve because of the coupled primal variables w k , b m by the maxðÁÞ operations. Inspired by Nie et al. (2014) and Wang and Zhao (2017), we split the primal variables in Equation (1) via the ADMM approach by introducing the following constraints: From Equation (3) we derive the following augmented Lagrangian function: where W; b; E; Q; T; R; U are the primal variables, K; R; H; X; N are the dual variables, and l > 0 is a hyperparameter.

The solution algorithm
In this section, we derive the efficient solution algorithm to minimize the proposed objective in Equation (4). In Algorithm 1, we repeat the primal-dual updates until the gap in constraints from the augmented Lagrangian terms in Equation (4) becomes smaller than a predefined tolerance. In order not to distract reading attention and due to space limit, we only provide the derivation details for the class-hyperplane in w m and b m for each mth class in the main article, and leave the derivations for the other variables in the online Supplementary Appendix of this article. b update. By differentiating Equation (7) element-wise with respect to b m and setting the result equal to zero, we have the following update: where i 0 indicates the column blocks that belong to the mth class are chosen from X. Taking the derivative of Equation (5) with respect to b m , setting the derivative equal to zero, and solving for b m gives: where N 0 is the total number of patients belonging to the mth class. W update (without kernel). We discard all terms in Equation (4) which do not include W and optimize the columns of W separately by solving the following K problems for m ¼ 1; . . . ; K: where N 0 is the number of bags which belongs to mth class, and i 0 denotes the indices of column blocks of X and the corresponding columns of U and N. Finally t m i ; h m i ; u m 0 i 0 and n m 0 i 0 are row vectors corresponding to the ith bag and mth class in T, H, U and N. By letting the derivative of Equation (7) with respect to w m equal zero, we attain the following closed form solution: In the calculation of Equation (8) we can avoid an inverse calculation through a least-squares solver.
W update (with kernel). The kernel method (Shawe-Taylor et al., 2004) is widely used in classification tasks to deal with non-linearity of the data. We provide the kernel extension of our method to learn the non-linear relationship between bag and target label. For the arbitrary (possibly non-linear) kernel function /, we map all the columns (instances) of X i 2 R dÂni to feature vectors /ðX i Þ ¼ U i 2 R dzÂni , and Equation (7) can be rewritten into: We take the derivative with respect to w m and set it equal to zero to solve for w m : where U ¼ ½U 1 ; . . . ; U N 2 R dzÂNt and U0 ¼ ½U 1 0 ; . . . ; U N 0 2 R dzÂN 0 t .
n i denote the total number of instances which belongs to all classes and mth class respectively, and U0 contains the N 0 column blocks of U corresponding to the mth class. However, the dimensionality d z of feature vectors U of kernel function can be very large (possibly infinitely large), thus calculating ðI=l þ UU T þ KU 0 U T 0 Þ À1 in Equation (10) may not be computationally feasible. In order to derive the scalable solution against arbitrary kernel function, we rewrite Equation (10) into the following matrix form: where ðu m 0 0 À 1b m þ n m 0 0 =lÞ; D ¼ ½I; 0; 0; KI andÛ ¼ ½U; U0 . Then we can apply the following kernel trick (Welling, 2013) to Equation (11): which gives:  (1). In our model, each patch corresponds to each instance in a bag. We first calculate the distance from the hyperplane of each mth class to the farthest instance, which is the key instance triggering the bag label. By minimizing Equation (1), we optimize W and b of the hyperplanes so that the distance to the hyperplane of the correct class (m ¼ y) is greater than the distance to the hyperplane of the incorrect (m 6 ¼ y) class In Equation (12), we avoid to compute the feature vectors U in the possibly large dimensionality d z . Instead we need to compute the inner product of feature vectorsÛ TÛ 2 R ðNt þN 0 t ÞÂðNtþN 0 t Þ which is usually more efficient than directly computing UU T 2 R dzÂdz .
The algorithm to solve the proposed objective in Equation (4) is summarized in Algorithm 1.

Avoiding calculations of the least-squares problems
As can be seen in Equation (8), the update for w m is reliant on solving a least squares problem in every iteration. However, the least squares solver has complexity OðNd 2 Þ and will have to be solved every iteration which may not be computationally feasible if the number of features d is very large. To avoid this problem we can instead utilize an optimal line search method (Nie et al., 2014) and update w m via gradient descent: where r wm is the analytical gradient of Equation (4) with respect to w m : and it can be used to define a minimization: in terms of s m instead of w m . Differentiating Equation (15) with respect to s m , setting the result equal to zero gives: Finally we plug Equations (14) and (16) into Equation (13) to earn an efficient update equation which avoids the least squares problem in Equation (8). The time complexity of the proposed method is OðNdnÞ, where n is the average number of instances per bag. The number of instances n is typically smaller than the number of features d (the multiple of 162 in our experiments), therefore our model with the solution in Equation (13)

Results
In our experiments, we evaluate the classification performance and scalability of the proposed exact and inexact pdMISVM implementations. The scalability of pdMISVM is assessed across the increasing number of bags and features. Regarding the interpretability of our model, we also identify the disease relevant patches (instances) of each bag (image).

Benchmarks and hyperparameters
The classification performance and scalability of pdMISVM is compared against the following standard MIL benchmarks: • A SIL method that assigns the bags' labels to all instances during training and produces the maximum response for each bag/class pair at testing time for the training bag's instances.

Algorithm 1 The multiblock ADMM updates to optimize Equation (4).
Data: X 2 R DÂðn1þÁÁÁþnN Þ and Y 2 fÀ1; 1g KÂN . Hyperparameters: C > 0, l > 0; q > 1 and tolerance > 0. Initialize: primal variables W; b; E; Q; R; T; U and dual variables K; R; H; X; N. while residual > tolerance do for m 2 K do Update w m 2 W by Eq: ð13Þ: where Update r m p 2 R by for j 2 n p do Update t m p;j 2 T by end for Update l ¼ ql. end while return ðw m ; . . . ; w K Þ 2 W and ðb 1 ; . . . ; b K Þ 2 b.
• The two bag-based methods; Normalized Set Kernel (NSK) and Statistics Kernel (STK) (Gä rtner et al., 2002), which map the entire bag to a single-instance by a way of kernel function. • An iterated discrimination Axis-Parallel Rectangles algorithm (APR) (Dietterich et al., 1997): the APR is a MIL model which starts from a single positive instance and grows the APR by expanding it to cover the remaining positive instances. • The two multi-instance deep learning methods: The mi-Net and MI-Net (Wang et al., 2018) approach to the MIL problem in a way of instance space and embedded space (learning vectorial representation of the bag) paradigm respectively. • The two attention mechanism-based MIL models: Ilse et al.
(2018) (AMIL) calculate the parameterized attention (importance) score for each instance to generate the probability distribution of bag labels. Shi et al. (2020) (LAMIL) propose to learn the instance scores and predictions jointly by integrating the attention mechanism with the loss function.
For these SVM models, the regularization tradeoff is set to 1.0. For the exact and inexact pdMISVM, the regularization tradeoff C is set to 1e À 3 and 1e þ 4 respectively, the tolerance is set to 1e À 5 for both, and l is initialized with 1e À 10 and 1e À 8 respectively. We use the radial basis kernel function for all SVM models (except inexact pdMISVM which uses linear kernel). For the deep learning models (mi-Net, MI-Net, AMIL and LAMIL), we use the same hyperparameters as in their articles.

Classification performance
In this section, we evaluate the classification models to investigate whether our exact/inexact pdMISVM achieves the better or comparable performance to the best performing classical or recent models. In Table 1, we report the performance of our pdMISVM compared against the other MIL algorithms in the classification of benign/malignant bags. For each model, we provide the precision, recall, F1score, accuracy and balanced accuracy (BACC) across the 10 6-fold cross-validation experiments (six repetitions per experiment).
From the results reported in Table 1, the proposed exact/inexact pdMISVM show promising performance across the various magnification levels. In particular, our exact pdMISVM outperforms the other models based on recall. A high recall rate is critical in the medical domain, as false negatives may result the serious consequences. This result shows the clinical utility of our model as it is crucial not to miss a malignant tumor in the diagnosis. When the SIL model is compared to the other MIL models, SIL performed the worst because it is difficult to accurately classify labels from individual patches. For example, evidence of malignancy may appear only in some patches of the bag. In this case, it is difficult to classify a patch as a malignancy from a patch where no evidence of a malignant tumor appeared. Our experimental results support the assumption that MIL models will classify better than SIL model. Interestingly, our exact pdMISVM performs better than the inexact version at the smaller magnification levels, and while the opposite results are observed in the larger magnification levels. These results show that the classification pattern of pdMISVM can vary depending on the choice of optimization approach, just like the impact of the optimization algorithm on the deep learning models (Wang et al., 2019). Although our derivation for inexact pdMISVM does not obtain the exact optimal solution of the MISVM objective in Equation (1), our experimental results show that the inexact solution may improve the classification performance when compared to the exact solution. This is well supported by the previous finding (Chang et al., 2008) that some implementations of SVM achieve the highest accuracy before the objective reaches its minimum. Our exact/inexact pdMISVM has gained the overall improved accuracy/ BACC as well, and this validates their usefulness in the field of MIL and the early detection of a malignant tumor.

The scalability against bags and features
The main contribution of this study is that the derived Algorithm 1 scales to the large dataset. In this timing experiment, our goal is to verify the analytical complexity calculated in Section 3.4 on the real world dataset. We plot the training time of the classifiers on the BreaKHis dataset to verify this improved scalability against the number of bags in Figure 3 and the number of features in Figure 4. In this timing experiment, we use the linear kernel function for all SVM models. The deep learning models are excluded in this experiment as their training times exceed the reasonable limit (5 h). In Figure 3, the running time of NSK increases rapidly while the other models maintain the linear trend. Our pdMISVM outperforms the other models in training a large number of bags. This result validates the superior scalability of the proposed primal-dual approach over the other SVM models which rely on repeatedly solving a quadratic programming problem.
Despite the fact that the initial derivation with Equation (8) scales well with respect to the bags, the update for w k in Equation (8) requires solving a least-squares problem that scales quadratically as the number of features d increases. To tackle this difficulty, we adapt an optimal line search method in Equation (13) to achieve the linear complexity against the number of features. In Figure 4, we compare the training time of the exact/inexact versions of our models to the other competing models. Among all models, Ours-inexact and APR spend the smallest training time when trained with the large number of features. Interestingly, inexact variation of pdMISVM scales significantly better than the exact pdMISVM against the increasing number of features where the number of bags is fixed at 1000. This is well represented by the analytical complexity of the two derivations (OðNd 2 Þ versus OðNdnÞ) as discussed in Section 3.4.

Patch identification
Along with the improved prediction performance and scalability, our model pdMISVM can identify disease-relevant locations. The interpretability is crucial as it can add confidence to the generated predictions and help clinicians use histopathological references to make a diagnosis. We calculate the patch-wise importance maxðW T x j i þ bÞ which is the response of jth patch to the decision function in Equation (2). Figures 5 and 6 show the identified patches in the benign and malignant images at 400Â magnification level. In Figures 5 and 6, the 10 boxes (patches) represent the 10 instances of each bag (image).
The patches identified by our model are in accordance with the clinical insights. The color, shape and size morphologic abnormalities of the nuclei are regarded as the key characteristics that categorize a digitized biopsy as cancerous or non-cancerous (Rajbongshi et al., 2018). For example, in the third image in Figure 5, our model highlights the regions containing the cell's nuclei. From the identified patches, our model can reveal that the nuclear to cell volume ratio is consistent throughout which is a distinctive feature of noncarcinoma (Jevti c and Levy, 2014). Because of this, our model correctly classifies the bag as benign. A previous study  explains that a disorganized arrangement of cells is one of the characteristics of cancerous cells. In the second image in Figure 5, our model identifies a continuous, organized distribution of cells so this is another indication that our model was correct in labeling this image as benign. For the three malignant samples in Figure 6, our model focuses on the variation in the size and shape of nuclei. Based on the literature (Fischer, 2020), the loss of normal morphology and large/varying shape of nuclei are essential for the diagnosis of malignancy in the practice of surgical pathology. The accurately identified regions validate the correctness of our model in the histopathological image classification and add value to its clinical practicability.

Discussion
We demonstrated that the MIL SVM can detect the malignancy in the patches. With the development of image acquisition technology, and it has become crucial to train the models with the large amount of images to improve classification performance. Accordingly, scalability has emerged as a major issue, and the improved scalability can increase the performance and decrease the cost in response to    the system processing demands of CAD. Therefore, this study proposes a new optimization method for SVM with improved scalability. The proposed method reduces the computational complexity against the large number of features of instances by approximating the optimal point of SVM, but nevertheless, the experimental results show that the classification performance of SVM is not sacrificed, and rather improved in certain cases (in the lower magnification levels). In addition, the permutation invariant property is satisfied in Equation (1), which is desirable in the MIL. The proposed optimization method can be applied regardless of whether the kernel function is used, however we plan to deal with the improved scalability of kernelized SVM in the future study. In this study, we have sampled the patches of WSIs at the random locations, and we plan to integrate the attention mechanism to automatically sample the patches important for malignant tumor detection. In this study, we propose a general framework for MIL and the other models stemming from our approach can be flexibly applied to solve the various MIL problems.

Conclusion
The improvement of the scalability of methods is attracting more attention from machine learning studies as the amount of available data is increasing due to the development of data mining technologies. In this work, we present a novel Primal-Dual Multi-Instance SVM method and the associated derivations, which scale to a large number of bags and features. We have conducted extensive experiments on the BreaKHis dataset to show the promising performance and scalability of the proposed method when compared to the traditional SVM-based MIL techniques. In addition to the improved classification performance and scalability, the key patches for the classification identified by our model are well supported by previous medical studies. The experimental results illustrate the clinical utility of our approach on the detection of cancerous abnormalities in a large dataset to prevent the progression of breast cancer in a patient.

Funding
This work was supported in part by the National Science Foundation (NSF) under the grants of Information and Intelligent Systems (IIS) [1652943,1849359], Computer and Network Systems (CNS) [1932482].
Conflict of Interest: none declared.